1 OVERVIEW

In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

1.1 Objective:

To build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables provided (or variables that we will derive from the variables provided).

2 DATA EXPLORATION

2.1 Data Summary

The dataset consists of two data files: training and evaluation. The training dataset contains 17 columns, while the evaluation dataset contains 16. The evaluation dataset is missing column TARGET_WINS. We will start by exploring the training data set since it will be the one used to generate the regression model.

First we see that all data is numeric.

An important aspect of any dataset is to determine how much, if any, data is missing. We look at all the variables to see which if any have missing data. We look at the percentages of the data that are missing:

vars n mean sd median trimmed mad min max range skew kurtosis se na_count na_count_perc
TARGET_WINS 2 191 80.92670 12.115013 82 81.11765 13.3434 43 116 73 -0.1698314 -0.2952783 0.8766116 0 0.0
TEAM_BATTING_H 3 191 1478.62827 76.147869 1477 1477.42484 74.1300 1308 1667 359 0.1302702 -0.3710350 5.5098664 0 0.0
TEAM_BATTING_2B 4 191 297.19895 26.329335 296 296.62745 25.2042 201 373 172 0.0915189 0.4778716 1.9051238 0 0.0
TEAM_BATTING_3B 5 191 30.74346 9.043878 29 30.13072 8.8956 12 61 49 0.7007420 0.7446217 0.6543921 0 0.0
TEAM_BATTING_HR 6 191 178.05236 32.413243 175 176.81046 35.5824 116 260 144 0.2980673 -0.7172373 2.3453399 0 0.0
TEAM_BATTING_BB 7 191 543.31937 74.842133 535 541.31373 74.1300 365 775 410 0.3115199 -0.1474175 5.4153867 0 0.0
TEAM_BATTING_SO 8 191 1051.02618 104.156382 1050 1046.95425 97.8516 805 1399 594 0.3985050 0.3955105 7.5364913 102 4.5
TEAM_BASERUN_SB 9 191 90.90576 29.916401 87 89.06536 29.6520 31 177 146 0.5553966 -0.1414909 2.1646748 131 5.8
TEAM_BASERUN_CS 10 191 39.94241 11.898334 38 39.49020 11.8608 12 74 62 0.3468509 0.0006392 0.8609332 772 33.9
TEAM_BATTING_HBP 11 191 59.35602 12.967123 58 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681 2085 91.6
TEAM_PITCHING_H 12 191 1479.70157 75.788625 1480 1478.50327 72.6474 1312 1667 355 0.1279056 -0.3894781 5.4838725 0 0.0
TEAM_PITCHING_HR 13 191 178.17801 32.391678 175 176.93464 35.5824 116 260 144 0.2989191 -0.7190905 2.3437795 0 0.0
TEAM_PITCHING_BB 14 191 543.71728 74.916681 537 541.74510 72.6474 367 775 408 0.3144366 -0.1338563 5.4207808 0 0.0
TEAM_PITCHING_SO 15 191 1051.81675 104.347208 1052 1047.80392 97.8516 805 1399 594 0.3945586 0.3903991 7.5502990 102 4.5
TEAM_FIELDING_E 16 191 107.05236 16.632162 106 106.58170 17.7912 65 145 80 0.1780432 -0.3567367 1.2034610 0 0.0
TEAM_FIELDING_DP 17 191 152.33508 17.611682 152 152.04575 19.2738 113 204 91 0.2164822 -0.2115741 1.2743366 286 12.6

From this result we can see how several variables have a number of missing values. The maximum number of missing values was 2085 in the TEAM_BATTING_HBP variable. This is a significant amount of missing data representing 91.6% of that data.

With missing data asseced, we can look into descriptive statistics in more detail. Interestingly we find that the difference between means and medians is fairly small for all data columns. The maximum difference is in fact only 4.77%. This means that we are to expect the distributions of this data to be fairly uniform. To visualize this we plot histograms for each data.

The plot of distributions does show fairly uniform data, but it also show the potential precense of outliers in at least two of the predictors. This is not the best way to vizualise ouliers. Instead we identify the predictors which seem to have outliers by looking at the scattered and box plots. Two variables with outliers appera to be TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB and TEAM_FIELDING_E. We highlight these variables from the desity plots since we can see most of the data concentrated at the lower end of the scales which show tailing off to high values.

TEAM_PITCHING_H

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1137    1419    1518    1779    1682   30132

From the summary for this variable we see that the maximum is 3.013210^{4}, which is substantially far from the median of 1518. We can also see a wide spread between mean and median of 261. But we know the maximum is not a realistic number. We can compare the numbers of pitching hits, to the number of batting hits, summary for which is shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     891    1383    1454    1469    1537    2554

For batting the maximum is 2554, much lower than 3.013210^{4}. We expect these two variables to somewhat equal values since one is the reciprocal of the other. As a saminity check of the distribution of the batting variable, we now see the mean and median much closer together with a spread of only 15.

TEAM_PITCHING_SO

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   615.0   813.5   817.7   968.0 19278.0     102

Again looking at the maximum we find it is unreasonably high at 1.927810^{4}, which again is substantially far from the median of 813.5. Again this variable has a reciprocal in TEAM_BATTING_SO. Comparing agaisnt it confirms the outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   548.0   750.0   735.6   930.0  1399.0     102

We see that or batting the maximum is 1399, much lower than 1.927810^{4}. As before as a saminity check of the distribution of the batting variable, we now see the mean and median much closer together with a spread of only 14.4.

TEAM_PITCHING_BB

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   476.0   536.5   553.0   611.0  3645.0

Again looking at the maximum we find it is unreasonably high at 3645, which again is substantially far from the median of 536.5. Again this variable has a reciprocal in TEAM_BATTING_SO. Comparing agaisnt it confirms the outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   451.0   512.0   501.6   580.0   878.0

We see that or batting the maximum is 878, much lower than 3645. As before as a saminity check of the distribution of the batting variable, we now see the mean and median much closer together with a spread of only 10.4.

TEAM_FIELDING_E

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    65.0   127.0   159.0   246.5   249.2  1898.0

Again looking at the maximum we find it is unreasonably high at 1898, which again is substantially far from the median of 159.

2.2 Correlation Plot

Looking at correlation of variables to number of wins provides some interesting data. We find some correlations that make sense from what might assume with subject knowledge of base, e.g., the number of hits and number of variables both have significant positive correlation with Wins and other statistics like stolen bases, while still positive, are not so strongly related. What is surprising though, are the pitching statistics. We would assume that a team that allowed the opposing team more hits, would lose more games (and win less), but that is not what the data shows us. Perhaps there are outliers swaying the correlation.

Regardless, we can use some of these correlations to drive initial models later, in terms of likely fields to choose for an effective model.

##      TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## [1,]           1      0.3887675       0.2891036       0.1426084
##      TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## [1,]       0.1761532       0.2325599              NA              NA
##      TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## [1,]              NA               NA      -0.1099371        0.1890137
##      TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## [1,]        0.1241745               NA      -0.1764848               NA

2.3 Missing and Invalid Data

Are any of the variables missing and need to be imputed “fixed”?

3 DATA PREPARATION

3.1 New Variable Creation / Removal (Due to Collinearity and/or Missing Values)

First task under data preparation will be to eliminate all missing data. In the Data Exploration section we found one variable, TEAM_BATTING_HBP with an exceptionaly high percentage of missing data, so we commence by eliminating this variable. We also remove the “INDEX” column as that is not used.

Next task is to handle missing data in the other variables. Here, becouse the percentages of missing data are lower, we can replace missing data with the median. We prefer replacing with median instead of mean because the latter is more sensitive to outliers. So we get a clean dataset without missing values.

Note we also consider zeros to be missing data. Since each row is a season of data for a given baseball team, it would be extraordinarily unlikely that any of these statistics would have zero as an actual value. Therefore we are assuming zero is another indicator of missing value and we will transform them into a median value.

In the exploratory phase we also identified several variables with outliers. Outliers will be substituted with median. Again we choose median becouse it is less influenced by these outliers. What cutoff to use to tag an outlier reading could be a 3 standard deviation from the mean, or 1.5 time the inter quartile range, but in this case becouse these variables have reciprocals as seen in the exploratory phase, we will use the maximum reading of those variables.

TEAM_PITCHING_H

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1137    1419    1518    1562    1636    2544

From the summary we now see that the maximum is 2544, which is a much more reasonable number. We can also see a wide spread between mean and median of 44, indicating a more normal distribution than before.

TEAM_PITCHING_SO

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   181.0   633.0   816.0   796.8   948.2  1399.0

From the summary we now see that the maximum is 1399, which is a much more reasonable number. We can also see a wide spread between mean and median of -19, indicating a more normal distribution than before.

Lastly, before we create models, lets divide data into test and training sets, with 80% for training, 20% for test. This way we have a method to validate our models.

4 BUILD MODELS

Batting only model

Combine all batting variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.157  -8.688   0.679   9.599  45.949 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.472589   3.369651   5.185  2.4e-07 ***
## TEAM_BATTING_H  0.043178   0.002281  18.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.4 on 1818 degrees of freedom
## Multiple R-squared:  0.1646, Adjusted R-squared:  0.1642 
## F-statistic: 358.2 on 1 and 1818 DF,  p-value: < 2.2e-16

Pitching only model

Combine all pitching variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.379  -9.340   0.787   9.917  67.847 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      52.476611   2.691696  19.496  < 2e-16 ***
## TEAM_PITCHING_H   0.015148   0.001626   9.319  < 2e-16 ***
## TEAM_PITCHING_HR  0.045438   0.005863   7.750 1.51e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 1817 degrees of freedom
## Multiple R-squared:  0.06883,    Adjusted R-squared:  0.06781 
## F-statistic: 67.16 on 2 and 1817 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.111  -8.461   0.808  10.497  42.679 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.86903    3.00594   12.60   <2e-16 ***
## HITS_NOHR    0.03144    0.00218   14.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.92 on 1818 degrees of freedom
## Multiple R-squared:  0.1027, Adjusted R-squared:  0.1022 
## F-statistic:   208 on 1 and 1818 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR + TEAM_BATTING_BB + TEAM_FIELDING_E, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.256  -9.176   0.192   9.555  53.546 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     11.419903   3.533612   3.232  0.00125 ** 
## HITS_NOHR        0.047475   0.002322  20.445  < 2e-16 ***
## TEAM_BATTING_BB  0.018140   0.003524   5.148 2.92e-07 ***
## TEAM_FIELDING_E -0.018679   0.002141  -8.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 1816 degrees of freedom
## Multiple R-squared:  0.2172, Adjusted R-squared:  0.2159 
## F-statistic:   168 on 3 and 1816 DF,  p-value: < 2.2e-16
## Warning in abline(hitsNoHR_bb_e_mod): only using the first two of 4
## regression coefficients

Best in terms of residuals and Rsquared Hits, BB, and Fielding Errors. Plost look good except for short tailed issues in the QQ plot.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.767  -8.959   0.003   9.071  50.788 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.277272   3.639218   1.175 0.240018    
## TEAM_BATTING_H   0.050269   0.002305  21.813  < 2e-16 ***
## TEAM_BATTING_BB  0.012509   0.003507   3.567 0.000371 ***
## TEAM_FIELDING_E -0.014193   0.002022  -7.018 3.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 1816 degrees of freedom
## Multiple R-squared:  0.237,  Adjusted R-squared:  0.2357 
## F-statistic:   188 on 3 and 1816 DF,  p-value: < 2.2e-16
## Warning in abline(hits_bb_e_mod): only using the first two of 4 regression
## coefficients

Attempt at boxcox, didn’t achieve better results in R squared (not SE not directly comparable due to adjustment with boxcox) The QQ plot seems to look a bit better as the negative quantiles are much closer to the line. Box Cox

## 
## Call:
## lm(formula = TARGET_WINS_BC ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -194.979  -38.736   -1.184   37.665  223.257 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -64.017547  15.331986  -4.175 3.12e-05 ***
## TEAM_BATTING_H    0.211530   0.009709  21.787  < 2e-16 ***
## TEAM_BATTING_BB   0.053146   0.014776   3.597 0.000331 ***
## TEAM_FIELDING_E  -0.051346   0.008520  -6.026 2.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.99 on 1816 degrees of freedom
## Multiple R-squared:  0.2317, Adjusted R-squared:  0.2305 
## F-statistic: 182.6 on 3 and 1816 DF,  p-value: < 2.2e-16

5 SELECT MODELS

5.1 Compare Key Statistics

5.2 Pick the top two

5.3 Pick the “best” model

5.4 Apply to the evaluation data

5.5 Conclusion

6 APPENDIX